Process cohort table, metadata, filtering

Main figures

Cohort last updated: 2020-10-29 => 2021-04-02 patient exclusions => 2021-07-26 diagnoses updates and patient exclusions

Diagnoses

For fun: Adding the additional patients with TSFs:

previous plots with validation and discovery separated (hidden)

Validation figure comparing Fusion-sq to FFPM >0.1

Key message: clinically relevant fusions (color) can be detected also when lowly expressed (dots below the line). Also, filtering on FFPM > 0.1 is not a good strategy for identification of drivers, since then other fusion predictions remain as well (grey jitter points above the line). In addition, specificity of our approach: only few additional high-confidence tumor-specific fusions are identified (black dots)

Note:Reciprocal fusions removed for display purposes. With jittered points for all predictions. Note: “SS18–SSX1” == “AC091021.1–SSX1” Note: –MYC reciprocal only for PMCID340AAO

Previous simpler versions (hidden)

Validation: RNA-DNA distance and tool concordance

Fusions and SVs can be matched at a large variety of distances between the RNA and DNA breakpoints due to using intron-exon gene structure (upstream and downstream gene respectively on the x and y axis). In comparison, even a large 10 kb distance would not be sufficient in all cases (red lines). In addition, all clinically relevant fusions are detected by at all three tools at nucleotide resolution at the precise intervals (adj intron/flank/sj) except for ASPSCR1–TFE3 which requires composite for Manta and shows slight differences between the tools.

See other markdown for numbers on precision.

=> TODO: update with overlaps and bp distances. Figure no longer correct

2021-05-26 validation set = clin rel fusion in right orientation unless no other is available need per tool the DNA-RNA distance of the high conf selected SV

Tumor-specific SVs per patient

Key message: few fusion predictions per patient with WGS support. patient-specific

Tumor-specific fusions in discovery set only

Fusions per patient and genomic instability

Percentiles for gene fusion burden

## [1] "somatic_hc cnt"
##   50%   90%   95%   99% 
##  0.00  3.00  5.55 12.84 
## [1] "Used in the paper: somatic_low_af_hc cnt ,"
##   50%   90%   95%   99% 
##  0.50  3.00  7.00 16.13 
## [1] "any somatic_low_af cnt (includes low confidence),"
##   50%   90%   95%   99% 
##  1.00  4.00  7.00 17.55

Fraction of genome altered distribution and median in red

Plots relating FGA, somatic/low-af fusion burden (high confidence) and CNA

per patient find patterns fga/cna/fusion burden

Combine predicted and high conf in 1 figure

## [1] TRUE

Predicted ffpm >0.1

## [1] 2.55

With all predicted fusions

## [1] 22.6

SV type analysis

Everything is for high confidence fusions and counting every patient-fusion once or every uq fusion once => For paper look at uq gene pairs

Unique gene pairs

Excl ambiguous from the figures/analysis & set KIAA1549–BRAF svtype label to “DUP” instead of “complex” because of “DUP, INV” because that one patient also had l2fc and now it screws up my figures uq_gene_pairs_hc[uq_gene_pairs_hc$fusion_name==“KIAA1549–BRAF”,c(“svtype_label”)]=“DUP”

Patient-fusion

SV types

Allele frequency

SV size

Annotation

Gene expression, exon imbalance and CNAs

Based on selected fusion candidates (see supplementary)

FPKM plots pan cancer, group, domain

Subset to onco/tsg, somatic only (no low af)

Check which ones have no functional effect annotated yet Annotate as ‘anno has kinase’, remove from label, mark fusions with astrix if genomically unstable patient

## [1] "tsg,oncogene" NA             "tsg"          "oncogene"

For manuscript

Keep display fpkm above

For presentation

Expression of clin-rel fusions

Expression in high/low fusion burden patients

domain is based on primary_group_label

ONLY patients with high fusion burden

Presentation figure of fold change without the log for PMCID203AAL only: low-AF as well.

Without patients with high fusion burden

All cancer related that are not yet categorised

Without patients with high fusion burden, only categorised ones

Somatic onco tsg and chimera without assigned category

How do CNAs and expression relate?

What reference do we use for expression? Pan cancer vs group vs domain

Compare the different z scores, split by gene label, upstream/downstream and also fusion positive patient or not.

The gating with red lines is to make it visible what we would lose/gain if you use domain/pan cancer/group It looks like using domain/domain-assigned is just fine!

Since domain assigned is almost the same, dont bother with the re-annotation!

Systematic expression candidate fusions

##Candidate fusions in patients with stable genome

MTAP-CDKN2B-AS1

Only recurrent somatic fusion not identified as clinically relevant

account for CNA in that region on CDKN2A expression of patients

TP53 mutations

Wilcox test of group means TP53 fused patients against rest of patients in domain (solids)

## [1] "TP53 mean fpkm log of patients"
## [1] 0.879911
## [1] "mean fpkm of domain, excl TP53- fusion patients"
## [1] 2.003468
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  filter(gene_expression_analysis, ensembl_id %in% candidate$gup_ensembl_id & patient_id %in% candidate$patient_id)$fpkm_log and filter(gene_expression_analysis, ensembl_id %in% candidate$gup_ensembl_id & !patient_id %in% candidate$patient_id & domain_label %in% candidate$domain_label)$fpkm_log
## W = 15, p-value = 0.01709
## alternative hypothesis: true location shift is not equal to 0

Same for full cohort

## [1] 2.004524
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  filter(gene_expression_analysis, ensembl_id %in% candidate$gup_ensembl_id & patient_id %in% candidate$patient_id)$fpkm_log and filter(gene_expression_analysis, ensembl_id %in% candidate$gup_ensembl_id & !patient_id %in% candidate$patient_id)$fpkm_log
## W = 39, p-value = 0.01921
## alternative hypothesis: true location shift is not equal to 0

LSAMP

RB1

PMCID308AAK no sign reduction in expression but shows fusion and -0.36 CN deletion Maybe because RB1 is often deleted/affected?

MEF2B

Renal tumor candidates

HOXA9

Supplementary

Patient flow diagram

Fusion detection flow diagram

First on total cohort -> more is ambiguous -> then load uq fusions df

Precise confident vs not precise confident

Supplementary Figure 1: Main flow diagram Healthy chimera and cancer databases not included here because these are overlapping sets so not sure how to indicate that in a flow diagram.

## [1] "High conf (med: 1 mean: 1.8)"

Unique gene pairs

Support by multiple tools

How many tools support one fusion ? NB: not all patients have 3 tools

##    patient_id manta_present delly_present gridss_present patient_label
## 1 PMCID143AAM          TRUE          TRUE          FALSE       M863AAC
## 2 PMCID332AAA          TRUE          TRUE          FALSE       M479AAA
## 3 PMCID730AAJ          TRUE          TRUE          FALSE       M156AAA
## 4 PMCID451AAM          TRUE          TRUE          FALSE       M606AAA

Classified as tumor-specific by AF overall

Only patients with all 3 tools

Detected as somatic by that tool (based on AF)

Classified as tumor-specific by AF overall

Tumor-normal AF scatter plot

High confidence

For figures: schematic for classification Line based on: (t-n)/n > 1.5 == t/n > 2.5

Pairwise overlap and distance plots

High confidence

CTX and copy number

## [1] 73
## [1] 65

Germline CTX is only one

How can low AF be high confident / CN distribution of variant classes